Analysis of Features Influencing Red Wine Quality by Chengyu Hang

Univariate Plots Section

Let’s first plot the histogram of fixed acidity

The Fixed acidity value seems to dispaly a normal distribution. Let’s see volatile acidity distribution

The Volatile acidity value seems to dispaly a more like normal distribution ontaking the log distribution.

Let’s see more features’ ditribution From Above plots, following observations are made:

The histogram is highly skewed to left.

Quality is distributed from 3 - 8. Most wine exhibit medium(5 - 6) quality.

Most of the wines fall in the range of 4 to 6 in terms of quality.

Univariate Analysis

What is the structure of your dataset?

There are 1599 red wine in this data set with 12 features (fixed.acidity, volatile.acidity, citric.acid, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide, density, pH, sulphates, alcohol, quality)

What is/are the main feature(s) of interest in your dataset?

The main features in the data set are alcohol and quality. I’d like to determine which features are best for predicting quality of wine. I think alcohol, quantity of SO2~ (free and total) and acidity (both fixed and volatile) might be used for predictive modeling to determine quality of wine.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

alcohol, quantity of SO2~ (free and total) and acidity (both fixed and volatile)

Did you create any new variables from existing variables in the dataset?

Yes, quality.level is the variable added to the dataset which distributes the sample into 3 quality bins (0,4], (4,6] and (6,10].

Of the features you investigated, were there any unusual distributions?

According to all the above plots, there are some outliers in some of the features like SO2(free and total), acidity (fixed and volatile). Also the distribution for Volatile acidity apears to be bimodal normal distribution. But when taking log distribution, the plot becomes normal distributed.

Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

No

Bivariate Plots Section

From the correlation matrix, the following behaviors are observed:

1.Fixed Acidity shows significant negative correlation with pH and volatile acidity.

2.Volatile Acidity is highly negatively correlated with citric acid and quality.

3.Free SO2 shows significant positive correlation with total SO2.

4.Density shows significant negative correlation with alcohol, acidity (fixed and citric acid) and pH.

  1. Quality and alcohol is positively correlated along with negative correlation with volatile acidity.

Also from above scatterplot matrix, chlorides and sulphates doesn’t seem to have any kind of effect to quality.

Let’s have some box plots with quality level to observe the outliers.

For PH, most of the outliers seem to lie in quality range (4,6].

For aclcohol, most of the outliers also seem to lie in quality range (4,6].

Only a few outliers are obersrved for citric acid.

For SO2, it contains outliers in all the quality level range.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

  • Fixed acidity and citric acid are significantly correlated.
  • Alcohol content for quality less than 6 seems to be higher.
  • Volatile Acidity is higher for quality levels more than 4.
  • Wine samples with less density have high alcohol content.
  • Residual sugar is not useful9 to classify quality of wine.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

Chlorides and sulphates does not exhibit any significant relationships with any other features. Also, most of the outliers are in the quality range (4,6] and this is not good for the prediction models.

What was the strongest relationship you found?

  • Positive :
    • Fixed acidity - density
    • Free SO1 - total SO2
  • Negative :
    • Volatile acidity and Citric Acid

Multivariate Plots Section

Let’s now dig deeper into the correlation between quality and other features:

There seems to be no significant bias of the alcohol content. With some exceptions that some samples with higer Alcohol content exhibiting a higher density reading for the quality levels equaling to 3 and 5.

Negative correlation of volatile acidity and quality are summarized below:

It seems that wine with higher volatile acidity exhibiting higher density for quality levels 5,7 and 8.

Lets find out the relation between residual sugar and quality.

Quality rating shows higher density of residual sugar (while quailty=3 is little lower). But no significant pattern is observed, thus sugar wouldn’t be helpful to predict quality.

##   quality.level Mean_Alcohol Median_Alcohol
## 1         (0,4]     10.21587           10.0
## 2         (4,6]     10.25272           10.0
## 3        (6,10]     11.51805           11.6

Good wines concentrate when citric acid is more than 0.3 and alcohol is more than 10.5. That is, if we have certain levels of both then we have higher quality.

Negative correlation is observed here. Most of wine samples with quality 5 seems to be distributed with alcohol content less 11% by volume, while samples with quality 7 above 11% alcohol by volume.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

Good wines concentrate when citric acid is more than 0.3 and alcohol is more than 10.5.

Were there any interesting or surprising interactions between features?

Good wines concentrate when citric acid is more than 0.3 and alcohol is more than 10.5. That is, if we have certain levels of both then we have higher quality.


Final Plots and Summary

Plot One

Description One

Before the testing, I thought residual sugar will play an important role in defining the quality of wine (which it does). However, being significant in every level of wine qulaity will not actually help me to determine the quality level.

Plot Two

Description Two

Good wines concentrate when citric acid is more than 0.3 and alcohol is more than 10.5. That is, if we have certain levels of both then we have higher quality.

Plot Three

Description Three

Negative correlation is observed between alcohol, density and quality. Also Density and Alcohol show the strongest correlation among all wine parameters.

Reflection

The data set contains information on almost 1599 wine sampels across 12. In initial phase, I started understanding individual variables(univariate analysis), from which I explored interesting questions and made observations. Then I explored quality of wine accross mltiple variables (bivariate analysis and multivariate analysis).

There are many other factors that are related with good wines. Many of them are related with smells and flavours and not with chemical properties and gustative perceptions like these that we have in our dataset. Although our variables are kind of explanatory of what we have, we have also seen some cases where the must be other explanations for high or low quality levels.

One of the major challenges in this analysis was the limitations of the dataset. The variable of interest, wine quality, was an integer value measured on a scale of 0 to 10. However, the vast majority of the wines (1,319 out of 1,599) received a score of 5 or 6. Only 63 wines received a score of 3 or 4, and 217 wines received a score of 7 or 8. No wines received scores of 0, 1, 9, or 10. Since the wine quality variable had such limited variability, it was difficult to assess the relationship between quality and the chemical attribute variables. Having a greater variety of quality ratings or having finer gradations in the quality ratings might have allowed for a more nuanced analysis.

Reference

http://www.sthda.com/english/wiki/correlation-matrix-a-quick-start-guide-to-analyze-format-and-visualize-a-correlation-matrix-using-r-software https://medium.freecodecamp.org/using-data-science-to-understand-what-makes-wine-taste-good-669b496c67ee https://medium.com/@jeromevonk/red-wine-quality-exploration-ea88e6b0e3c5